THE PAV ALGORITHM OPTIMIZES BINARY PROPER SCORING 

RULES 

NIKO BRUMMERtt AND JOHAN DU PREEZ* 

Abstract. There has been much recent interest in application of the pool-adjacent-violators 
(PAV) algorithm for the purpose of calibrating the probabilistic outputs of automatic pattern recog- 
nition and machine learning algorithms. Special cost functions, known as proper scoring rules form 
natural objective functions to judge the goodness of such calibration. We show that for binary 
pattern classifiers, the non-parametric optimization of calibration, subject to a monotonicity con- 
straint, can solved by PAV and that this solution is optimal for all regular binary proper scoring 
£/*) rules. This extends previous results which were limited to convex binary proper scoring rules. We 

i— —, further show that this result holds not only for calibration of probabilities, but also for calibration 

of log-likelihood-ratios, in which case optimality holds independently of the prior probabilities of the 
^sj pattern classes. 

Key words, pool-adjacent-violators algorithm, proper scoring rule, calibration, isotonic regres- 
sion 
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1. Introduction. There has been much recent interest in using the pool-adjacent- 
j— — , moteioraj (PAV) algorithm for the purpose of calibration of the outputs of machine 

Ph learning or pattern recognition systems [3T1 [7J [Ml 1301 H3 US] ■ Our contribution is to 

point out and prove some previously unpublished results concerning the optimality of 
using the PAV algorithm for such calibration. 

In the rest of the introduction, j |l.l| defines calibration; |1.2| introduces regular 

binary proper scoring rules, the class of objective functions which we use to judge the 

goodness of calibration; and |1.3| gives more specific details of how this calibration 

— problem forms the non-parametric, monotonic optimization problem which is the 

|> subject of this paper. 

The rest of the paper is organized as follows: In f|2]we state the main optimization 
\fj problem under discussion; f|3] summarizes previous work related to this problem; SJ4J 

fvi the bulk of this paper, presents our proof that PAV solves this problem; and finally Sj5] 

shows that the PAV can be adapted to a closely related calibration problem, which 
has the goal of assigning calibrated log-likelihood-ratios, rather than probabilities. We 
conclude in SJ6] with a short discussion about applying PAV calibration in pattern 
recognition. 

The results of this paper can be summarized as follows: The PAV algorithm, 
when used for supervised, monotonic, non-parametric calibration is (i) optimal for all 
regular binary proper scoring rules and is moreover (ii) optimal at any prior when 
«|3 calibrating log-likelihood-ratios. 

1.1. Calibration. In this paper, we are interested in the calibration of binary 
pattern classification systems which are designed to discriminate between two classes, 
by outputting a scalar confidence scorg^j Let x denote a to-be-classified input pat- 
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terrj^J which is known to belong to one of two classes: the target class 6%, or the 
non-target class 02- The pattern classifier under consideration performs a mapping 
x h- > s, where s is a real number, which we call the uncalibrated confidence score. The 
only assumption that we make about s is that it has the following sense: The greater 
the score, the more it favours the target class — and the smaller, the more it favours 
the non-target class. 

In order for the pattern classifier output to be more generally useful, it can be 
processed through a calibration transformation. We assume here that the calibrated 
output will be used to make a minimum-expected-cost Bayes decision p21 [29] . This 
requires that the score be transformed to act as posterior probability for the target 
class, given the score. We denote the transform of the uncalibrated score s to cali- 
brated target posterior thus: s n- P(6i\s). In the first (and largest) part of this paper, 
we consider this calibration transformation as an atomic step and show in what sense 
the PAV algorithm is optimal for this transformation. 

In most machine- learning contexts, it is assumed that the object of calibration is 
(as discussed above) to assign posterior probabilities [26], [3TJ [24]. However, the cali- 
bration of log-likelihood-ratios may be more appropriate in some pattern recognition 
fields such as automatic speaker recognition |141 [7J . This is important in particular 
for forensic speaker recognition, in cases where a Bayesian framework is used to repre- 
sent the weight of the speech evidence in likelihood-ratio form [T7] . With this purpose 
in mind, in 85[ we decompose the transformation s i— > P(9i\s) into two consecutive 
steps, thus: s i— s- log p( s L 1 * i— > P(9i\s), where the intermediate quantity is known 
as the log-likelihood-ratio for the target, relative to the non-target. The first stage, 
s i— > log pi^L 1 ^ , is now the calibration transform and it is performed by an adapted 

PAV algorithm (denoted PAV-LLR), while the second stage, log p4fnn ^ -P(^iI s )j i s 
just standard application of Bayes' rule. One of the advantages of this decomposition 
is that the log-likelihood-ratio is independent of P(#i), the prior probability for the 
target class — and that therefore the pattern classifier (which does ii->s) and the cal- 
ibrator (which does s H> log pffrgH) can both be independent of the prior. The target 
prior need only be available for the final step of applying Bayes' rule. Our important 
contribution here is to show that the PAV-LLR calibration is optimal independently 
of the prior P(0i). 

1.2. Regular Binary Proper Scoring Rules. We have introduced calibration 
as a tool to map uncalibrated scores to posterior probabilities, which may then be 
used to make minimum-expected-cost Bayes decisions. We next ask how the quality 
of a given calibrator may be judged. Since the stated purpose of calibration is to 
make cost-effective decisions, the goodness of calibration may indeed be judged by 
decision cost. For this purpose, we consider a class of special cost functions known as 
proper scoring rules to quantify the cost-effective decision-making ability of posterior 
probabilities, see e.g. [T51 |TJ] [J31 HH (SI US] , or our previous work [7J. Since this paper 
is focused on the PAV algorithm, a detailed introduction to proper scoring rules is 
out of scope. Here we just need to define the class of regular binary proper scoring 
rules in a way that is convenient to our purposes. (Appendix [A] gives some notes to 
link this definition to previous work.) 

We define a regular binary proper scoring rule (RBPSR) to be a function, C p : 



3 The nature of x is unimportant here, it can be an image, a sound recording, a text document 
etc. 
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{01, 2 } x [0, 1] -> [0, oo], such that 

C p {6i,q)= ( -p{ri)dr,, C p (9 2 ,q) = f -^—p(r,)d V (1) 

Jq V JO L ~~ V 

for which the following conditions must hold: 

(i) These integrals exist and axe finite, exceplrlpossibly for C p (0i, 0) and C p (6*2, 1), 
which may assume the value oo. 

(ii) p(rj) is a probability distribution^] over [0,1], i.e. p(rj) > for < r\ < 1, 

and Jo P( 7 l)df] = 1- 

In other words the RBPSR's are a family of functions parametrized by p. If p(j]) > 
almost everywhere, then the RBPSR is denoted strict, otherwise it is non-strict. We 
list some examples, which will be relevant later: 

1. If p(rj) = S(n — rj'), where 8 denotes Dirac-delta, then C p (-,q) represents 
the misclassification cost of making binary decisions by comparing probability q to 
a threshold of rf . Note that this proper scoring rule is non-strict. Moreover it is 
discontinuous and therefore not convex as a function of q. This is but one example of 
many non-convex proper scoring rules. A more general example is obtained by convex 
combination^ of multiple Dirac-deltas: p(rj) = J^ oti8(rj — rfc). 

2. If pyrf) = 677(1 — rf), then C p is the {strict) quadratic! 7 ] proper scoring rule, 
also known as the Brier scoring rule [6]. 

3. If p{rj) = 1, then C p is the (strict) logarithmic scoring rule, originally pro- 
posed by [T5] . 

The salient property of a binary proper scoring rule is that for any < p,q < 1, 
its expectations w.r.t q are minimized at q, so that: qC p (9\,q) + (1 — q) C p (02,<?) < 
q C p (9i,p) + (1 — q) C p (92,p). For a strict RBPSR, this minimum is unique. We show 
below in lemma ramow this property derives from (nj). 

1.3. Supervised, monotonic, non-parametric calibration. We have thus 
far established that we want to find a calibration method to map scores to probabilities 
and that we then want to judge the goodness of these probabilities via RBPSR. We 
can now be more specific about the calibration problem that is optimally solvable by 
PAV: 

1. Firstly, we constrain the calibration transformation s h- > P(9i\s) to be a 
monotonic non- decreasing function: M — > [0, 1]. This is to preserve the above-defined 
sense of the score s. This monotonicity constraint is discussed further in f6j See 
also EJ [21 OH- 

2. Secondly, we assume that we are given a finite number, T, of trials, for each 
of which the to-be-calibrated pattern classifier has produced a score. We denote these 
scores si, S2, . . . st- We need only to map each of these scores to a probability. In 
other words, we do not have to find the calibration function itself, we only have to 
non-parametrically assign the T function output values p\,p%, ■ . ■ ,Pti while respecting 
the above monotonicity constraint. To simplify notation, we assume without loss of 
generality, that s± < S2 < ■ ■ • < sy. (In practice one has to sort the scores to make 



4 This exception accommodates cases like the logarithmic scoring rule, which is obtained at p(r]) = 
1, see [TTlfT6] . 

5 It is easily shown that if p(rf) cannot be normalized (i.e. L p(r/) d-q — > oo), then one or both of 
Cp ($1 , ?) or C p ($2,<?) must also be infinite for every value of q, so that a useful proper scoring rule 
is not obtained. 

6 The cti > and sum to 1. 

7 In this context the average of the Brier proper scoring is just a mean-squarcd-error. 
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it so.) This now means that monotonicity is satisfied if < pi < p% < • • • < Pt < 1> 
Notice that the input scores now only serve to define the order. Once this order is 
fixed, one does not need to refer back to the scores. The output probabilities can now 
be independently assigned, as long as they respect the above chain of inequalities. 

3. Finally, we assume that the problem is supervised: For every one of the T 
trials the true class is known and is denoted: l\,t%, ...,£t € {61,62}. This allows 
evaluation of the RBPSR for every trial t as C p (£ t ,Pt)- A weighted combination of the 
RBPSR costs for every trial can now be used as the objective function which needs 
to be be minimized. 

In summary the problem which is solved by PAV is that of finding pi,P2, ■ ■ ■ ,Pt, 
subject to the monotonicity constraints, so that the RBPSR objective is minimized. 
This problem is succinctly restated in the following section: 

2. Main optimization problem statement. The problem of interest may be 
stated as follows: 

1. We are given as input: 

(i) A sequence of T indices, denoted (1, T) = 1, 2, . . . , T with a corresponding 
sequence of labels £1,^2, • • • At € {61, #2}- 

(ii) A pair of positive weights, wi,U2 > 0. 

2. We use the notation v(£t) to assign a weight to every index, by letting v(6i) = 
vi and v(02) = V2- 

3. The problem is now to find the sequence of T probabilities, denoted Pi.t = 
Pi,P2i • ■ ■ iPt> which minimizes the following objective: 

T 

Oi,t(Pi,t) = 5>&) C W''>1*)> (2) 

t=i 

subject to the monotonicity constraint: 

< Pi < P2 < ■ ■ ■ < PT < 1 (3) 

We require the solution to hold (be a feasible minimum) simultaneously for every 
RBPSR C p . We already know that if such a solution exists, it must be unique, because 
the original PAV algorithm as published in !|j in 1955, was shown to give a unique 
optimal solution for the special case of p{rj) = 1, for which (C p (6i,p),C p (62,p)) = 
(— log(p), — log(l — p)). See theorem nl and corollary [2] below for details. 

3. Relationship of our proof to previous work. Although not stated explic- 
itly in terms of a proper scoring rule, the first publication of the PAV algorithm [1] , 
was already proof that it optimized the logarithmic proper scoring rule. It is also 
known that PAV optimizes the quadratic (Brier) scoring rule [3T], and indeed that it 
optimizes combinations of more general convex functions [2] . However as pointed 
out above, there are proper scoring rules that are not convex. 

In our previous work [7], where we made use of calibration with the PAV algo- 
rithm, we did mention the same results presented here, but without proof. This paper 
therefore complements that work, by providing proofs. 

We also note that independently, in 15, it was stated "it can be proved that the 
same [PAV algorithm] is obtained when using any proper scoring function" , but this 
was also without proof or further referencesr) 



8 Notcs to reviewers: Note 1: We contacted Fawcet and Niculescu-Mizil to ask if they had a proof. 
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We construct a proof that the PAV algorithm solves the problem as stated in <|2j by 
roughly following the pattern of the unpublished document [T] , where the optimality 
of PAV was proved for the case of strictly convex cost functions. That proof is not 
applicable as is for our purposes, because as pointed out above, some RBPSR's are 
not convex. We will show however in lemma [6] below, that all RBPSR's and their 
expectations are quasiconvex and that the proof can be based on this quasiconvexity, 
rather than on convexity. Note that when working with convex cost functions, one can 
use the fact that positively weighted combinations of convex functions are also convex, 
but this is not true in general for quasiconvex functions. For our case it was therefore 
necessary to prove explicitly that expectations of RBPSR's are also quasiconvex. A 
further complication that we needed to address was that non-strict RBPSR's lead to 
unidirectional implications, in places where the strictly convex cost functions of the 
proof in pQ gave if and only if relationships. 

Finally, we note that although the more general case of PAV for non-strict convex 
cost functions was treated in [5], we could not base our proof on theirs, because they 
used properties of convex functions, such as subgradients, which are not applicable to 
our quasiconvex RBPSR's. 

4. Proof of optimality of PAV. This section forms the bulk of this paper and 
is dedicated to prove that a version of the PAV algorithm solves the optimization 
problem stated in S|2j 
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Lemmas 3&4 Lemma 6 



Corollary 2 ■ 



Theorem 7 — ► Lemmas 8 & 9 
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-> Theorem 10 



Theorem 1 1 : PAV algorithm 



-+• Theorem 12: PAV-LLR algorithm 



Fig. 1. Proof structure: 
RBPSR's and priors. 



PAV is optimal for all RBPSR's and PAV-LLR is optimal for all 



They replied that their statement was based on the assumption that proper scoring rules are convex, 
which by [5] is then optimized by PAV. Since we include here also non-convex proper scoring rules, 
our results are more general. Note 2: The paper |28j has the word 'quasi-convex' in the title and 
employs the PAV algorithm for a solution. This could suggest that our problem was solved in that 
paper, but a different problem was solved there, namely: "the approximation problem of fitting n 
data points by a quasi-convex function using the least squares distance function." 
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See figure[l]for a roadmap of the proof: Theorem [l] and corollary [5] give the closed- 
form solution for the logarithmic RBPSR. For the PAV algorithm, we use corollary [2] 
just to show that there is a unique solution, but we re-use it later to prove the prior- 
independence of the PAV-LLR algorithm. Inside the dashed box, theorem [5] shows 
how multiple optimal subproblem solutions can constitute the optimal solution to 
the whole problem. Theorems [7] and [lO] respectively show how to find and combine 
optimal subproblem solutions, so that the PAV algorithm can use them to meet the 
requirements of theorem [5] 

4.1. Unique solution. In this section, we use the work of Ayer et al, reproduced 
here as theorem [T] to show via corollary [2] that, if our problem does have a solution 
for every RBPSR, then it must be unique, because the special case of the logarithmic 
scoring rule (when p{n) = 1) does have a unique solution. 

Theorem 1 (Ayer et al., 1955). Given non-negative real numbers at,bt, such that 
o-t + bt > for every t = 1, 2, . . . , T , the maximization of the objective 0[ ^(pi.t) = 
[} t=1 (p t ) a *(l — pt) bt , subject to the monotonicity constraint ([3]), has the unique solu- 
tion, Pit = pi,p 2 , ■ ■ • ,Pt, where: 



(4) 
= mm max r, .-, 
t<j<Tl<i<t ,J 

where 

V^ n, 

(5) 

Proof. Seaj[4], theorem 2.2 and its corollary 2.1. In that work, the monotonicity 
constraint was non-increasing, rather than the non-decreasing constraint pi) that we 
use here. The solution that they give therefore has to be transformed by letting the 
index t go in reverse order, which means exchanging the roles of the subsequence 
endpoints i,j, which then has the result of exchanging the roles of max and min in 
the solution. □ 

We now show that this theorem supplies the solution for the special case of the 
logarithmic RBPSR: 

Corollary 2. If (C p (0i,p),C p (6 2 ,p)) = (-log(p),-log(l-p)) ; then the prob- 
lem of minimizing objective ([2]), subject to constraint J3J), has the unique solution, 
Pi,t = Pi,P2, ■■-,Pt, where: 



Pt = 


max mm r, , 

l<i<t t<j<T '•> 




min max r' „•, 

t<j<Tl<i<t ,J 


r' 


Efc=i a k 


EiU °fc + h k 



Pt = PAV t ((£ 1 J 2l ...,Z T )Av 1 ,v 2 )) 



max mm r 

l<i<t t<j<T 



i -j 



mm max n , , 

t<j<T l<i<t ' J 



(6) 



whe 



n, = "Y 1 (7) 

where m^j is the number of 6\ -labels and nij the number of 02-labels in subsequence 
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Proof. Observe that if we let 

(a t A) = 

then r[j = nj, so that 0' 1t (pi,t) = exp(— Ci,t(Pi,t)), so that the constrained 
maximization of theorem [l] and the constrained minimization of this corollary have 
the same solution. D 

This corollary gives a closed- form solution, pL to the problem, and from [3] 
we know that this is the same solution which is calculated by the iterative PAV 
algorithnp°| As noted above, it has so far [U [U E] only been shown that this solution 
is valid for logarithmic and other RBPSR's which have convex expectations. In the 
following sections we show that this solution is also optimal for all other RBPSR's. 

4.2. Decomposition into subproblems. We need to consider subsequences of 

(1,T): For any 1 < i < j < T, we denote as (i,j) the subsequence of (1, T) which 
starts at index i and ends at index j. We may compute a partial objective function 
over a subsequence (i, j) as: 

j 
0^{v l , 3 ) = Y J <Zt)C p {t uPt ). (8) 

where p; j = pi,pi + \, . . . ,pj. We can now define the subproblem (i,j) as the problem 
of minimizing Oij(pij), simultaneosly for every RBPSR, and subject to the mono- 
tonicity constraint < pi < Pi+i < • • • < Pj < 1. In what follows, we shall use the 
following notational conventions: 

1. The subproblem (1,T) is equivalent to the original problem. 

2. We shall denote a subproblem solution, Pi,j, as feasible when the monotonic- 
ity constraint is met and non-feasible otherwise. 

3. By subproblem solution we mean just a sequence P;j, feasible or not, such 
that pi,p i+1 ,.. .,pj € [0,1]. 

4. Since any subproblem is isomorphic to the original problem, corollary [2] also 
shows that i^J it has a feasible minimizing solution for every RBPSR, then that 
solution must be unique. Hence, by the optimal subproblem solution, we mean the 
unique feasible solution that minimizes Oij(-), for every RBPSR. 

5. By a partitioning of the problem (1,T) into a set, S, of adjacent, non- 
overlapping subproblems, we mean that every index occurs exactly once in all of 
the subproblems, so that: 

0i,t(pi,t)= E Oijfaj) ( 9 ) 

Our first important step is to show with theorem [51 proved via lemmas [3] and HI how 
the optimal total solution may be constituted from optimal subproblem solutions: 

Lemma 3. For a given RBPSR and for a given partitioning, S, of (1,T) into 
subproblems, let: 



10 The PAV algorithm, if efficiently implemented, is known I25ll2ll30l to have linear computational 
load (of order T), which is superior to a straight-forward implementation of the explicit form Q6p . 

llr The object of this whole exercise is to prove that the optimal solution exists for every subproblem 
and is given by the PAV algorithm, but until we have proved this, we cannot assume that the optimal 
solution exists for every subproblem. 
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(i) p* T = P11P21 ■ • ■ iPt be a feasible solution to the whole problem, with mini- 
mum total objective 0i,t(p* t)> an< ^ 

(ii) for every subproblem (i,j) G S, let q* - = q* ,q* +1 , . . . ,q* denote a feasible 
subproblem solution with minimum partial objective Oij(q* ■); and 

(Hi) q^ T = q*, q%, . . . , gj denote the concatenation of all the subproblem solutions 
q* p in order, to form a (not necessarily feasible) solution to the whole problem (1,T), 
then 

Oi,t« t )= E °i,M,j)< E 0« li (pJ ii ) = 1 ,T(pi > T). (10) 

(i,j)es (ij)es 

Proof. Follows by recalling ^ and by noting that for every (i,j), Oi,j(q* ■) < 
Oij(p* j), because (except at i = 1 and j = T) minimization of the RHS is subject 
to the extra constraints p*_j < p* and p* < Pj + %- □ 

Lemma 4. For a given RBPSR and for a given partitioning, S, of (1,T) into 
subproblems, let p* T = p\,p%, . .. ,Pj- be a feasible solution to the whole problem, 
with minimum total objective 0\ t T{p\ t)> an< ^ ^ Il.T — Qu Ii-, ■ ■ ■ , Qt be any feasible 
solution to the whole problem, with total objective Ci,t(<1i,t)- Then 

Oi,T(qi,T)= E °i,Mi,i) > Oi, T (p m ltT ). (11) 

(i,j)es 

Proof. Follows directly from Ety and the premise. □ 

Theorem 5. Let q\ T = ql,q 2l . ■ ■ ,q^ be a feasible solution for (1,T) and let 
S be a partitioning of (1, 7 1 ) into subproblems, such that for every (i,j) € S, the 
subsequence q* = q*, q* +1 , ■ ■ ■ , q* is the optimal solution to subproblem (i,j), then 
q^ T is the optimal solution to the whole problem (1,T). 

Proof. The premises make lemmas [3] and [4] applicable, for every RBPSR. Since 



both inequalities (101 and (111 are satisfied, 0i,t(c|.* t) = Ci,t(p* t)' wnere Pi t ^ s 
an optimal solution for each RBPSR. Hence q\ T is optimal for every RBPSR and is 
by corollary |2] the unique optimal solution. D 

4.3. Constant subproblem solutions. In what follows constant subproblem 
solutions will be of central importance. A solution p,-^ is constant if pi — pi+\ = 
■ ■ ■ = Pj = q, for some < q < 1. In this case, we use the short- hand notation 
Oi,j(l) — ®i,j(.Pij) 1° denote the subproblem objective, and this may be expressed 
as: 

O itj (q)=O i>j (p id )=Y,v(e t )C p (e t ,q) 

= mviC p (6i,q) + nv 2 C p {9 2 ,q), 

where m is the number of #i-labels and n the number of #2-labels. Note: 

1. A constant subproblem solution is always feasible. 

2. If it exists, the optimal solution to an arbitrary subproblem may or may not 
be constant. 

Whether optimal or not, it is important to examine the behaviour of subproblem 
solutions that are constrained to be constant. This behaviour is governed by the 
quasiconvesT 12 ! properties of Oij (q) as summarized in the following lemma: 



12 A real-valued function /(p), defined on a real interval is quasiconvex, if every sublevel set of the 
form {p\f(p) < a} is convex (i.e. a real interval) [3]. Lemma K3| shows that Oij(q) is quasiconvex. 
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Lemma 6. Let r> ,• = — ^^ — , where m is the number ofOi -labels and n the num- 

L >J vim-\-V2n ' J L 

ber of 9 2 -labels in the subsequence (i,j), and let Oij(q) = mv\ C p (#i, q)+nv2 C p (#2i q) 
be the objective for the constant subproblem solution, pi — Pi+\ = ■ ■ ■ = Pj = q, then 
the following properties hold, where C p is any RBPSR, and where we also note the 
specialization for strict RBPSR 's: 

1- Ifq<q' < nj, then O id (q) > Oij(q') > 0^{r^). 
strict case: If q < q' < r^j, then Oi d (q) > Oij(q'). 

2. Ifq'>q> r itj , then O itJ (q J ) > O l , ] {q)> Oij (re- 
strict case: If q' > q> r i: j, then Oij(q') > (Dij(q). 

3. min 9 O l . J {q) = O l . j (r i ^), 

strict case: q = r,^ is the unique minimum. 

(This is the salient property of binary proper scoring rules, which was mentioned 

above.) 

Proof. For convenience in this proof, we drop the subscripts i,j, letting r = nj = 
mv^+nv • ^ ne ex P ec ted value of C p (9,q) w.r.t. probability r is: 



e(q)=E elr {C p (6,q)} = T ^^O i ^q) 
= rC p (6i,q) + (l-r)C p (9 2 ,q) 



(13) 



Clearly, if the above properties hold for e(q), then they will also hold for Oij{q). 
We prove these properties for e(q) by letting q < q' and by examining the sign of 
A e = e(q') - e(q): If q' = q, then A e = 0. If q < q', then gives: 

A -J!<"- r) i@T)*' ,14) 

The non-strict versions of properties 1,2 and 3 now follow from the following obser- 
vation: Since p(r[) > for < 77 < 1, the sign of the integrand and therefore of A e 
depends solely on the sign of (77 — r), giving: 
(i) A e > 0, if r< q<q'. 
(ii) A e < 0, if q < q' < r. 
If more specifically, p{n) > almost everywhere, then for any < q < q' < 1, we have 
|A e | > 0. In this case, the RBPSR is denoted strict and we have: 
(i) A e >0, if r<q<q'. 
(ii) A e < 0, if q < q' < r. 
which concludes the proof also for the strict cases. D 

For now, we need only property [3] to proceed. We use the other properties later. 
The optimal constant subproblem solution is characterized in the following theorem: 
Theorem 7. If the optimal solution to subproblem (i,j) is constant, then: 

1. The constant is nj. 

2. For any index k, such that i < k < j, the following are both true: 
(i) r itk > r itj 

(ii) r kJ < r itj 
where r^jt and r^.j are defined in a similar way to r^j, but for the subproblems (i, k) 
and (fc, j). 

Proof. Property [l] of this theorem follows directly from property [3] of lemma [6j 
To prove property [21 we use contradiction: If the negation of 2(i) were true, namely 
Ti,k < fi.j, then the non-constant solution Pi = ■ ■ ■ = Pk — fi,k < Pk+i = ■ ■ ■ = Pj — 
rtj would be feasible and (by property k^ of lemma ro| would have lower objective, 
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namely 0»,fc(ri fe) + Ok+ij(rij), for any strict RBPSR, than that of the constant so- 
lution, namely Oi ! fe(^ J -)-|-Ofe + i J (r i ,•). This contradicts the premise that the optimal 
solution is constant, so that^Ti) must be true. Property 2(h) is proved by a similar 
contradiction. □ 

4.4. Pooling adjacent constant solutions. This section shows (using lem- 
mas p] and M to prove theorem 10) when and how optimal constant subproblem solu- 



tions may be assembled by pooling smaller adjacent constant solutions: 

Lemma 8. Given a subproblem (i,j), for which the optimal solution is constant 
(at Tij), we can form the augmented subproblem, with the additional constraint that 
the solution at j must satisfy pj < a, for some a such that < a < r,^-. That is, the 
solution to the augmented subproblem must satisfy < Pi < Pi+i < • • ■ < Pj < a < 
rij. Then the augmented subproblem solution is optimized, for every RBPSR, by the 
constant solution pi = Pi+\ = ■ ■ ■ = Pj = a. 

Proof. Feasible solutions to the augmented subproblem must satisfy either (i) 
Pi = ■ ■ ■ = Pj = a, or (ii) pi < a. We need to show that there is no feasible solution 
of type (ii), which has a lower objective value, for any RBPSR, than solution (i). 

For a given solution, let k be an index such that i < k < j and pt = pi+\ = ■ ■ ■ = 
Pk- By combining the premises of this lemma with property 2(i) of theorem[7j we hnd: 
Pi = ■■■ = p fe < a < n t j < n t k, or more succinctly: pi = ■ ■ ■ = p k < a < r hk . Now 
the monotonicity property [l] of lemma [6] shows that the value of pi = ■ ■ ■ = Pk, which 
is optimal for all BPSRs must be as large as allowed by the constraints. This means 
if we start at k = i, then pi is optimized at the constraint pi = Pi+\- Next we set 
k = i + 1 to see that p, L = Pi+\ is optimized at the next constraint pi = Pi+\ = Pi+2- 
We keep incrementing k, until we hnd the optimum for the augmented subproblem 
at the constant solution pi — ■ ■ ■ — pj — a. D 

Lemma 9. Given a subproblem {i,j), for which the optimal solution is constant 
(at rij), we can form the augmented subproblem, with the additional constraint that 
the solution at i must satisfy a < Pi, for some a such that rij < a < 1. That is, the 
solution to the augmented subproblem must satisfy ri^ < a < pi < Pi+i < • • • < Pj < 
1. Then the augmented subproblem solution is optimized, for every RBPSR, by the 
constant solution pi = Pi+\ = ■ ■ ■ = Pj = a. 

Proof. The proof is similar to that of lemmapl but here we invoke property 2(h) of 
theorem [71 to hnd: r k ,j < a < pk = ■ ■ ■ = Pj and we use the monotonicity property [2] 
of lemmajrJlto show that the value of Pk = • • • = Pj, which is optimal for all RBPSR's, 
must be as small as allowed by the constraints. D 

Theorem 10. Given indices i < k < j such that the optimal subproblem solutions 
for the two adjacent subproblems, (i,k) and (k + l,j), are both constant and therefore 
(by theoremyw have the respective values r^/. and r^+i.j , then, whenever rik > ^k+ij, 
the optimal solution for the pooled subproblem (i,j) is also constant, and has the value 

Proof. First consider the case r^fc = r^+ij- Since this forms a constant solution 
to subproblem (i,j), by theorem Pn the optimal solution is fij. 

Next consider r itk > rk+i,j- The solution pi = ••• = p k = n,k > Pk+i = ■■ ■ = 
Pj = rk+i.j is not feasible. A feasible solution must obey pk < a < Pk+i, for some a. 
There are three possibilities for the value of a: (i) a < rk+ij; (ii) ?"fc+i,j < a < r i: k] 
or (hi) r^fc < a. We examine each in turn: 

(i) If a < Tk+x j < rik, then the left subproblem (i,k) is augmented by the 
constraint a < ri t k, so that lemma[8| applies and it is optimized at the constant solution 
a, while the right subproblem (fc + 1, j) is not further constrained and is still optimized 
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at rk+ij- We can now optimize the total solution for (i,j) by adjusting a: By the 
monotonicity property IT] of lemma |6l the left subproblem objective and therefore also 
the total objective for (i,j) is optimized at the upper boundary a = rk+i,j< In other 
words, in this case, the optimum for subproblem (i,j) is a constant solution. 

(ii) If Tk+i,j < a < Ti.k, then lemma[8] applies to the left subproblem and lemma|9] 
applies to the right subproblem, so that both subproblems and therefore also the total 
objective for (i,j) are all optimized at a. In this case also we have a constant solution 
for (i,j). 

(hi) If rk+i.j < ri t k < oc, then the right subproblem is augmented while the left 
subproblem is not further constrained. We can now use lemma [9] and property [2] of 
lemma pi in a similar way to case (i) to show that in this case also, the optimum 
solution is constant. 

Since the three cases exhaust the possibilities for choosing a, the optimal solution is 
indeed constant and by theorem [7| the optimum is at fjj. D 

4.5. The PAV algorithm. We can now use theorems [SI |7| and [T0| to construct 
a proof that a version of the pool- adjacent-violators (PAV) algorithm solves the whole 
problem (1,T). 

Theorem 11. The PAV algorithm solves the problem stated in <Q 
Proof. The proof is constructive. The strategy is to satisfy the conditions for the- 
orem [BJ by starting with optimal constant subproblem solutions of length 1 and then 



to iteratively combine them via theorem 10 into longer optimal constant solutions 
until the total solution is feasible. The algorithm proceeds as follows: 
input: 

(i) labels, t u e 2 ,..., e T e{e 1 ,e 2 }. 

(ii) weights, vi,v 2 > 0. 
variables: 
(i) S, a partitioning of problem (1,T) into adjacent, non-overlapping subprob- 
lems. 

(ii) q^ T = q*, q 2 , . . . , q^,, a tentative (not necessarily feasible) solution for prob- 
lem (1,T). 

loop invariant: For every subproblem (i,j) G S: 
(i) The optimal subproblem solution is constant. 

(ii) The partial solution q* = q* , g* +1 , . . . , q* is equal to the optimal subprob- 
lem solution, i.e. constant, with value r<,j (by theorem [71). 

initialization: Let S be the finest partitioning into subproblems, so that there 
are T subproblems, each spanning a single index. Clearly every subproblem (i, i) has 
a constant solution, optimized at q* = r^j, which is 1, if It = 6\, or 0, if it — 6 2 . This 
initial solution q^ T respects the loop invariant, but is most probably not feasible. 
iteration: While qj T is not feasible: 

1. Find any pair of adjacent subproblems, (i, k), (k + l,j) £ S, for which the 
solutions are equal or violate monotonicity: r^fc > r^+i j- 

2. Pool (i,k) and (k + 1, j) into one subproblem (i,j), by adjusting S and by 
assigning the constant solution r,j to q* -, which by theorem 10 is optimal for (i,j), 
thus maintaining the loop invariant. 

termination: Clearly the iteration must terminate after at most T — 1 pooling 
steps, at which time qj T is now feasible and is still optimal for every subproblem. 
By theorem [5J q^ T is then the unique optimal solution to problem (1,T). D 

5. The PAV-LLR algorithm. The PAV algorithm as presented above finds 
solutions in the form of probabilities. Here we show how to use it to find solutions 
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in terms of log-likelihood-ratios. It will be convenient here to express Bayes' rule 
in terms of the logit function, logit (p) = log =-£-. Note logit is a monotonic rising 
bijection between [0, 1] and the extended real line. Its inverse is the sigmoid function, 
cr(w) = 1+ l- w ■ Bayes' rule is now [T§] : 



logitP(^|s t ) = iy t + 7r (15) 

P(st\0i) 

P(st\02) 



where the LHS is the posterior log-odds, w t = log pPyj is the log- likelihood-ratio 



and 7r = logit P{0\) is the prior log-odds. 

The problem that is solved by the PAV-LLR algorithm can now be described as 
follows: 

1. There is given: 

(i) Labels, £i,£2, ■ ■ ■ ,£t € {^1^2}- We denote as Ti and T 2 the respective 
numbers of 9\ and 62 labels in this sequence, so that T\ + T2 = T. 

(ii) Prior log-odds 7r, where —00 < 7r < 00. This determines a prior probability 
distribution for the two classes, namely (P(#i), -P(c? 2 )) = (c(7r), 1 — o-(ir)) , which may 
be different from the label proportions (^, ^). 

(hi) AnRBPSRC p 

2. There is required a solution w^j- = W\, W2, ■ ■ ■ , wt, which minimizes the 
following objective: 

T 

o 1 , T (vi,T) = Y,v(e t )c p (t u p t ), (i6) 

t=i 

p t =o-(w t +ir), (17) 

„, = „ (9l) = ?M = £W (18) 

--^fc)-^- 1 ^ (19) 

J2 ^2 

(The weights v\ , V2 are chosen thuspjto cancel the influence of the proportions of label 
types, and to re- weight the optimization objective with the given prior probabilities for 
the two classes, but we show below that this re-weighting is irrelevant when optimizing 
with PAV.) 

3. The minimization is subject to the monotonicity constraint: 

— 00 < w\ < W2 < • • • < wt < 00, (20) 

which by the monotonicity of (15 1 and the logit transformation is equivalent to (T3J) . 
This problem is solved by first finding the probabilities Pi,P2> • • • ,Pt via the PAV 



algorithm and then inverting (17) to find w t — logit(p t ) — tt. We already know that 
the solution is independent of the RBPSR, but remarkably, it is also independent of 
the prior tt. This is shown in the following theorem: 

Theorem 12. Let p t = PAV t ((li,l 2 , ...,£t), (^1,^2)) be given by ( |6|, then the 
problem of minimizing objective (|16[), subject to monotonicity constraint (20) has the 



13 This kind of class-conditional weighting has been used in several formal evaluations of the 
technologies of automatic speaker recognition and automatic language recognition, to weight the 
error-rates of hard recognition decisions 20, 22 and more recently to also weight logarithmic proper 
scoring of recognition outputs in log-likelihood-ratio form ,7 27 23 . 
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unique solution: 

w t = logitPAV t ((4,4, • ■ ■ At), (1, 1)) - logit j, (21) 

This solution is simultaneously optimal for every RBPSR, C p , and any prior log-odds, 

— OO < 7T < 00. 

Proof. By the properties of the PAV as proved in §4.5| and since logit is a strictly 
monotonic rising bijection, it is clear that for all RBPSR's and for a given it, this 
minimization is solved as 

w t = lofp.tVAV t ((£ l ,£ 2 ,...,t T ),(vi,v 2 ))-ir (22) 

where it determines v\ and V2 via (18) and (19). By corollary ^1 we can write compo- 
nent t of this solution, in closed form: 



Now observe that: 



w f = loait max min r,- „ 

\l<i<t t<j<T J 

= max min logit r.; 7 - - 

l<i<t t<j<T ' J 



logit rtj = logit 



(23) 



vim itj +v 2 n^ J 
= logit f logit — 



(24) 



which shows that wt is independent of n. Now the prior may be conveniently chosen 
to equal the label proportion, n = logit ^r, to give an un-weighted PAV, with v\ = 
v 2 = 1. D 

6. Discussion. We have shown that the problem of monotonic, non-parametric 
calibration of binary pattern recognition scores is optimally solved by PAV, for all 
regular binary proper scoring rules. This is true for calibration in posterior probability 
form and also in log-likclihood-ratio form. 

We conclude by addressing some concerns that readers may have about whether 
the optimization problem solved here is actually useful in real pattern recognition 
practice, where a calibration transform is trained in a supervised way (as here) on 
some training data, but is then utilized later on new unsupervised data. 

The first concern we address is about the non-parametric nature of the PAV 
mapping, because for general real scores there will be new unmapped score values. 
An obvious solution is to map new values by interpolating between the (input, output) 
pairs in the PAV solution and this was indeed done in several of the references cited 
in this paper (see e.g. |30j for an interpolation algorithm). 

Another concern is that the PAV mapping from scores to calibrated outputs 
has flat regions (all those constant subproblem solutions) and is therefore not an in- 
vertible transformation. Invertible transformations are information-preserving, but 
non-invertible transformations may lose some of the relevant information contained 
in the input score. This concern is answered by noting that expectations of proper 
scoring rules are generalized information measures [121 II 1 j and that in particular the 
expectation of the logarithmic scoring rule is equivalent to Shannon's cross-entropy 
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information measure |10j . So by optimizing proper scoring rules, we are indeed opti- 
mizing the information relevant to discriminating between the two classes. Also note 
that a strictly monotonic (i.e. invertible) transformation can be formed by adding 
an arbitrarily small strictly monotonic perturbation to the PAV solution. The PAV 
solution can be viewed as the argument of the infimum of the RBPSR objective, over 
all strictly rising monotonic transformations. 

In our own work on calibration of speaker recognition log-likclihood-ratios [5] , we 
have chosen to use strictly monotonic rising parametric calibration transformations, 
rather than PAV. However, we then do use the PAV calibration transformation in 
the supporting role of evaluating how well our parametric calibration strategies work. 
In this role, the PAV forms a well-defined reference against which other calibration 
strategies can be compared, since it is the best possible monotonic transformation that 
can be found on a given set of supervised evaluation data. It is in this evaluation role, 
that we consider the optimality properties of the PAV to be particularly important. 

For details on how we employ PAV as an evaluation tooP_4 see [21] ■ 

Acknowledgments. We wish to thank Daniel Ramos for hours of discussing 
PAV and calibration, and without whose enthusiastic support this paper would not 
have been written. 

Appendix A. Note on RBPSR family. Some notes follow, to place our def- 



inition of the RBPSR family, as defined in £1.2 in context of previous work. Our 
regularity condition (i) , directly below (fTl) , is adapted from [TTJ \W\ . General families 
of binary proper scoring rules have been represented in a variety of ways (see |16j 
and references therein), including also integral representations that are very simi- 
lar (but not identical in form) to our (fTl). See for example [13], where the form 
Iq P'iv) d-q, / 9 T^p'irj) drj was used; or 0QH] where J^(l-v)p"(v) dr], Jq r\p"(j]) dr\ 
was used. Equivalence to |lj is established by letting p'(rj) — ^^- and p"{rj) = £_ , . 
The advantage of the form (HI which we adopt here, is that the weighting function 
p(rj) is always in the form of a normalized probability density, which gives the natural 
interpretation of expectation to these integrals. 

The reader may notice that it is easy (e.g. by applying an affine transform to (flj)) 
to find a binary proper scoring rule which satisfies the properties of lemma [6] but 
which is not in the family defined by (TTl) . There are however equivalence classes 
of proper scoring rules, where the members of a class are all equivalent for making 
minimum-expected-cost Bayes decisions [TH [TT] . Elimination of this redundancy al- 
lows normalization of arbitrary proper scoring rules in such a way that the family (nl 
becomes representative for the members of these equivalence classes [7] . 
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